335 research outputs found
Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations
We present novel minibatch stochastic optimization methods for empirical risk
minimization problems, the methods efficiently leverage variance reduced
first-order and sub-sampled higher-order information to accelerate the
convergence speed. For quadratic objectives, we prove improved iteration
complexity over state-of-the-art under reasonable assumptions. We also provide
empirical evidence of the advantages of our method compared to existing
approaches in the literature
Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms
We consider empirical risk minimization of linear predictors with convex loss
functions. Such problems can be reformulated as convex-concave saddle point
problems, and thus are well suitable for primal-dual first-order algorithms.
However, primal-dual algorithms often require explicit strongly convex
regularization in order to obtain fast linear convergence, and the required
dual proximal mapping may not admit closed-form or efficient solution. In this
paper, we develop both batch and randomized primal-dual algorithms that can
exploit strong convexity from data adaptively and are capable of achieving
linear convergence even without regularization. We also present dual-free
variants of the adaptive primal-dual algorithms that do not require computing
the dual proximal mapping, which are especially suitable for logistic
regression
Reducing Runtime by Recycling Samples
Contrary to the situation with stochastic gradient descent, we argue that
when using stochastic methods with variance reduction, such as SDCA, SAG or
SVRG, as well as their variants, it could be beneficial to reuse previously
used samples instead of fresh samples, even when fresh samples are available.
We demonstrate this empirically for SDCA, SAG and SVRG, studying the optimal
sample size one should use, and also uncover be-havior that suggests running
SDCA for an integer number of epochs could be wasteful
Distributed Multitask Learning
We consider the problem of distributed multi-task learning, where each
machine learns a separate, but related, task. Specifically, each machine learns
a linear predictor in high-dimensional space,where all tasks share the same
small support. We present a communication-efficient estimator based on the
debiased lasso and show that it is comparable with the optimal centralized
method
Distributed Multi-Task Learning with Shared Representation
We study the problem of distributed multi-task learning with shared
representation, where each machine aims to learn a separate, but related, task
in an unknown shared low-dimensional subspaces, i.e. when the predictor matrix
has low rank. We consider a setting where each task is handled by a different
machine, with samples for the task available locally on the machine, and study
communication-efficient methods for exploiting the shared structure
Distributed Stochastic Multi-Task Learning with Graph Regularization
We propose methods for distributed graph-based multi-task learning that are
based on weighted averaging of messages from other machines. Uniform averaging
or diminishing stepsize in these methods would yield consensus (single task)
learning. We show how simply skewing the averaging weights or controlling the
stepsize allows learning different, but related, tasks on the different
machines
Efficient coordinate-wise leading eigenvector computation
We develop and analyze efficient "coordinate-wise" methods for finding the
leading eigenvector, where each step involves only a vector-vector product. We
establish global convergence with overall runtime guarantees that are at least
as good as Lanczos's method and dominate it for slowly decaying spectrum. Our
methods are based on combining a shift-and-invert approach with coordinate-wise
algorithms for linear regression
Multi-Information Source Optimization
We consider Bayesian optimization of an expensive-to-evaluate black-box
objective function, where we also have access to cheaper approximations of the
objective. In general, such approximations arise in applications such as
reinforcement learning, engineering, and the natural sciences, and are subject
to an inherent, unknown bias. This model discrepancy is caused by an inadequate
internal model that deviates from reality and can vary over the domain, making
the utilization of these approximations a non-trivial task.
We present a novel algorithm that provides a rigorous mathematical treatment
of the uncertainties arising from model discrepancies and noisy observations.
Its optimization decisions rely on a value of information analysis that extends
the Knowledge Gradient factor to the setting of multiple information sources
that vary in cost: each sampling decision maximizes the predicted benefit per
unit cost.
We conduct an experimental evaluation that demonstrates that the method
consistently outperforms other state-of-the-art techniques: it finds designs of
considerably higher objective value and additionally inflicts less cost in the
exploration process.Comment: Added: benchmark logistic regression on MNIST/USPS, comparison to
MTBO/entropy search, estimation of hyper-parameter
Efficient Distributed Learning with Sparsity
We propose a novel, efficient approach for distributed sparse learning in
high-dimensions, where observations are randomly partitioned across machines.
Computationally, at each round our method only requires the master machine to
solve a shifted ell_1 regularized M-estimation problem, and other workers to
compute the gradient. In respect of communication, the proposed approach
provably matches the estimation error bound of centralized methods within
constant rounds of communications (ignoring logarithmic factors). We conduct
extensive experiments on both simulated and real world datasets, and
demonstrate encouraging performances on high-dimensional regression and
classification tasks
Gradient Sparsification for Communication-Efficient Distributed Optimization
Modern large scale machine learning applications require stochastic
optimization algorithms to be implemented on distributed computational
architectures. A key bottleneck is the communication overhead for exchanging
information such as stochastic gradients among different workers. In this
paper, to reduce the communication cost we propose a convex optimization
formulation to minimize the coding length of stochastic gradients. To solve the
optimal sparsification efficiently, several simple and fast algorithms are
proposed for approximate solution, with theoretical guaranteed for sparseness.
Experiments on regularized logistic regression, support vector
machines, and convolutional neural networks validate our sparsification
approaches
- β¦